Goto

Collaborating Authors

 Soccer


How Qatar Became FIFA's Technology Test Lab

WIRED

Qatar has become the place where FIFA experiments with the next generation of football technology. The results are already visible across this year's World Cup. To casual soccer viewers, the game may look like it always has--same green field, 22 players, a referee, and the familiar rhythm of play unfolding over 90 minutes. The changes are only visible if you look beneath the familiar surface. What appears to be a traditional match is now supported by layers of tracking systems, automated analysis, and real-time data that run quietly in the background.


The history of brilliantly terrible World Cup video games

The Guardian

Seekers then come and try to find the hiders and, this being an online video game, shoot them. It's frantic, silly and fiendishly creative: finding a spot on one of the maps that you feel confident to paint yourself into - whether it's a laundry room or a farm outbuilding - is a challenging artistic and perceptual task, as well as a neat game mechanic. Meccha Chameleon perfectly encapsulates two popular and interconnected indie genres - prop games (hide and seek, but people disguise themselves as everyday objects) and the slightly pejoratively named "friendslop" (accessible, crudely designed multiplayer titles). So no wonder it has sold 7m units in less than a month.


NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

Neural Information Processing Systems

Compositional training has been the de-facto paradigm in existing Multimodal Large Language Models (MLLMs), where pre-trained visual encoders are connected with pre-trained LLMs through continuous multimodal pre-training. However, the multimodal scaling property of this paradigm remains difficult to explore due to the separated training. In this paper, we focus on the native training of MLLMs in an end-to-end manner and systematically study its design space and scaling property under a practical setting, i.e., data constraint. Through careful study of various choices in MLLM, we obtain the optimal meta-architecture that best balances performance and training cost. After that, we further explore the scaling properties of the native MLLM and indicate the positively correlated scaling relationship between visual encoders and LLMs. Based on these findings, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs. Besides that, our findings and results provide in-depth insights for the future study of native MLLMs.


Scaling to Long Videos

Neural Information Processing Systems

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 104K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chainof-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In our experiments, LongVILA-R1-7B achieves strong performance on video benchmarks, reaching 65.1% and 71.1% accuracy on VideoMME without and with subtitles, respectively, and consistently outperforming LongVILA-7B across multiple benchmarks. Moreover, LongVILA-R1-7B supports processing up to 8,192 video frames per video, and configurable FPS settings. Notably, our MR-SP system achieves up to 2.1 speedup on long video RL training. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames). Code and models are available at https://github.com/NVlabs/Long-RL


Retrv-R1: AReasoning-Driven MLLM Framework for Universal and Efficient Multimodal Retrieval

Neural Information Processing Systems

The success of DeepSeek-R1 demonstrates the immense potential of using reinforcement learning (RL) to enhance LLMs' reasoning capabilities. This paper introduces Retrv-R1, the first R1-style MLLM specifically designed for multimodal universal retrieval, achieving higher performance by employing step-by-step reasoning to produce more accurate retrieval results. We find that directly applying the methods of DeepSeek-R1 to retrieval tasks is not feasible, mainly due to (1) the high computational cost caused by the large token consumption required for multiple candidates with reasoning processes, and (2) the instability and suboptimal results when directly applying RL to train for retrieval tasks. To address these issues, Retrv-R1 introduces an information compression module with a details inspection mechanism, which enhances computational efficiency by reducing the number of tokens while ensuring that critical information for challenging candidates is preserved. Furthermore, a new training paradigm is proposed, including an activation stage using a retrieval-tailored synthetic CoT dataset for more effective optimization, followed by RL with a novel curriculum reward to improve both performance and efficiency. Incorporating these novel designs, Retrv-R1 achieves SOTA performance, high efficiency, and strong generalization ability, as demonstrated by experiments across multiple benchmarks and tasks.


d6d26053b977f8c589669fd201615119-Paper-Conference.pdf

Neural Information Processing Systems

Large language models (LLMs) are trained on a vast amount of human-written data, but data providers often remain uncredited. In response to this issue, data valuation (or data attribution2), which quantifies the contribution or value of each data to the model output, has been discussed as a potential solution. Nevertheless, applying existing data valuation methods to recent LLMs and their vast training datasets has been largely limited by prohibitive compute and memory costs. In this work, we focus on influence functions, a popular gradient-based data valuation method, and significantly improve its scalability with an efficient gradient projection strategy called LOGRA that leverages the gradient structure in backpropagation. We then provide a theoretical motivation of gradient projection approaches to influence functions to promote trust in the data valuation process. Lastly, we lower the barrier to implementing data valuation systems by introducing LOGIX, a software package that can transform existing training code into data valuation code with minimal effort. In our data valuation experiments, LOGRA achieves competitive accuracy against more expensive baselines while showing up to 6,500 /5 improvements in compute/memory efficiency in influence computations as well as 2 speed-up in gradient statistics logging when applied to Llama3-8B-Instruct and the 1B-token subset of the OpenWebText dataset.


Humanoid robots just got a workplace safety system

FOX News

NVIDIA introduces Halos for Robotics, which the company calls the industry's first full-stack safety system for robotics and physical AI operating near people.


Conformal Linguistic Calibration: Trading-off between Factuality and Specificity

Neural Information Processing Systems

Language model outputs are not always reliable, thus prompting research into how to adapt model responses based on uncertainty. Common approaches include: abstention, where models refrain from generating responses when uncertain; and linguistic calibration, where models hedge their statements using uncertainty quantifiers. However, abstention can withhold valuable information, while linguistically calibrated responses are often challenging to leverage in downstream tasks. We propose a unified view, Conformal Linguistic Calibration (CLC), which reinterprets linguistic calibration as answer set prediction. First we present a framework connecting abstention and linguistic calibration through the lens of linguistic pragmatics. We then describe an implementation of CLC that allows for controlling the level of imprecision in model responses. Results demonstrate our method produces calibrated outputs with conformal guarantees on factual accuracy. Further, our approach enables fine-tuning models to perform uncertainty-aware adaptive claim rewriting, offering a controllable balance between factuality and specificity.1


World Cup Scams Are Getting Harder to Spot

WIRED

From fake tickets to cloned websites, AI is magnifying World Cup scams. Can fans distinguish between what's real and what's not? You got a World Cup ticket. It arrived in your inbox with a QR code, professional branding, and a confirmation email that looked like the real thing. For years, spotting a scam was relatively simple.


'Law-Breaking Country': Iran Soccer Federation Escalates Tensions With U.S.

TIME - Tech

Follow this section to personalize your feed and get instant alerts. Follow Go to your personalized feed WHY FOLLOW? Smart Alerts: Get notified about major news as it happens. Follow this tag to personalize your feed and get instant alerts. Follow Go to your personalized feed WHY FOLLOW?